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1  Introduction 

This  document  describes  the  mechanism  for  performing  input  and  output  on  the  Tera  File  System. 
In  particular,  we  discuss  the  functional  characteristics  of  the  read(),  wri1e( ).  mmap(),  and  munmap() 
system  calls  as  they  are  implemented  on  Tera. 
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2  Read  and  Write 

A  read()  system  call  reads  a  specified  number  of  bytes  from  a  file’s  data  blocks  into  a  buffer  location 
in  the  user’s  virtual  address  space.  Similarly,  a  ivrite()  system  call  writes  a  specified  number  of 
bytes  from  a  user’s  virtual  addressed  buffer  into  a  file’s  data  blocks. 

Currently,  there  are  two  different  approaches  to  implement  rtad()  and  trrite().  We  will  discuss  and 
evaluate  them  in  the  following  sections. 

2.1  Implementation  1 

A  conventional  way  to  implement  fre<id()  and  formatted  rtads  involves  a  call  to  a  library  routine 
that  dispenses  data  from  its  own  internal  buffer  until  it  runs  out.  at  which  time  it  does  a  system 
call,  sys.readO  to  the  file  system.  The  file  system  moves  via  iiiovioi'c()  a  specified  amount  of  data 
from  its  buffer  cache  into  the  library  buffer.  This  continues  until  the  recpiested  amount  of  data  has 
been  read  into  a  user  supplied  buffer.  In  this  scenario,  the  size  of  the  library  buffer  determines  the 
frequency  of  system  calls  made  to  the  file  system. 

Using  fread()  as  an  example,  the  following  pseudocode  illustrates  a  conventional  Unix  implementa^ 
tion  of  fi-ead().  Note  that  this  example  is  not  intended  to  be  the  actual  Vuix  code.  This  pseudocode 
can  be  made  thread-safe  by  parallelizing  accesses  to  the  library  buffer.  A  barrier  must  be  provided 
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for  all  executing  threads  each  time  new  data  is  read  into  the  library  buffer. 


user  code: 


FILE  ♦stream; 
char  *buf; 
int  len; 

stream  =  fopen( filename,  "r"); 
error  =  fread(buf.  len,  1,  stream); 

library  code; 


fread(stream,  buf.  len)  { 

char  *libbuf  =  stream- >_p;  j*  libniry  buffer  for  st leant  */ 
int  resid  =  stream- >_r;  /*  data  not  yet  read  in  libbuf  */ 

while  (len)  {  /*  loop  till  actual  kn  is  read  */ 

if(!resid)  {  j*  if  no  more  data  in  libbuf  */ 

I*  call  file  system  to  read  data  into  libbuf  */ 
libbuf  =  stream- >_p  =  stream- >_sbuf.base; 
resid  =  sys_read(); 

/*  reinitialize  libbuf  base  address  */ 

} 

/*  copy  data  from  libbuf  to  user  buf  */ 
bcopy(buf,  libbuf,  size); 

libbuf  +=  size;  /*  increment  libbuf  ptr  *f 
/*  calculate  irhat  heis  not  been  read  in  libbuf  */ 
resid  -=  size: 
len  -=  size; 

} 

stream  — >  p  =  libbuf:  /*  update  ptr  to  libbuf  V 

} 

system  code:  DlIC  QUALITy  mSPECnSD  8 

sys_read( )  { 

lock_vnode( ); 

/*  do  file  system  and  device  dependent  read  *] 

VOP.READO; 

uiomove( ):  /*  a  copy  to  libbuf  */ 

fd— >f_ofrsel  +=  len;  /*  update  offset  in  file  sc  table  */ 

♦retval  =  len: 
unlock  vnode( ): 

} 
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2.2  Implementation  2 


The  mmap()  system  call  with  a  PROT^HARED  option  has  been  proposed  as  a  possible  alternative 
for  performing  user  itadf)  and  write().  This  implementation  is  very  similar  to  the  previous  one 
except  data  resides  in  a  shared  buffer  cache  memory  instead  of  an  internal  library  buffer  area.  First, 
a  call  to  munmupO  is  called  to  ensure  that  any  previously  mmapped  segment  is  unmapped  from 
the  specified  virtual  address  range.  Then.  mmap()  is  called  to  map  a  block  of  data  less  than  or 
equal  to  the  size  of  a  file’s  data  block  to  the  task’s  virtual  address  space. 

Note  that  by  using  mmap()  as  a  mechanism  for  all  reof/s  and  writes,  the  maximum  limit  of  opened 
file  descriptors  will  be  restricted  by  the  maximum  number  of  shared  mmapped  data  blocks  allowable 
per  task.  Since  the  limits  on  opened  file  descriptors  is  usually  larger  in  comparison  to  the  limits  on 
shared  mmapped  data  blocks,  this  is  not  a  desirable  codependency. 

Using  fread()  again  as  an  example,  the  following  pseudocode  shows  how  mmap()  is  used  for  imple- 
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menting  fr€ad().  Parallelization  of  this  code  is  similar  to  that  of  a  conventional  implementation. 


user  code: 


FILE  ♦  stream; 
char  *buf: 
int  len; 

stream  =  fopen(filename,  "r"); 
error  =  fread(buf,  len,  1,  stream); 

library  code: 


fread(stream,  buf,  len)  { 
int  size: 

char  *mmapbuf  =  stream— >_p;  /*  mmap  buffer  for  stream  */ 
int  resid  =  stream- >_r;  /*  data  not  yet  read  in  mmapbvf  */ 

while  (len)  {  /*  loop  till  actual  len  is  read  */ 

if(!resid)  {  f*  if  no  more  data  in  mmapbvf  */ 

/  *  first,  vnmap  current  buffer  */ 
munmap(  st  ream  -  >__sbuf.base); 

/*  mmap  as  shared  */ 

mmapbuf  =  mmap(0,  fcresid,  prot.  flag,  stream ->_file); 
/  *  reinitialize  mmap  buffer  base  address  */ 
stream- >_p  =  stream ->_sbuf.  base  =  mmapbuf: 

} 

/*  copy  delta  from  mmapbuf  to  user  buf  */ 
bcopy(buf.  mmapbuf.  size); 

mmapbuf  +=  size;  /*  increment  mmapbuf  ptr  */ 

/*  calculate  xrhat  has  not  been  read  in  mmapbuf  *! 
resid  -=  size; 
len  -=  size: 

} 

stream— >  p  =  mmapbuf;  /*  update  stream  ptr  */ 

} 
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system  code: 


mmapO  { 

lock_viiode( ); 

I*  do  file  system  and  device  de])endent  read  */ 
VOP_READ( ); 

pa  =  vm_mmap();  /*  mmap  to  vser's  addr  space 

&  increment  refettuct  count  */ 

unlock_vnode{ ); 

*retval  =  pa: 


munmapO  { 

vm_munniap(pa); 


} 


/*  UJimap  from  user's  addr  s{Hi(e 
&  decrement  lefeienn  count  */ 
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2.3  Evaluations  of  Implementations 

Table  1  compares  these  two  approaches.  The  primary  advantage  of  mmap  is  saving  a  data  copy 
from  system  to  library  buffer.  However,  at  first  glance  it  is  not  clear  what  effect  the  mmap() 
implementation  will  have  on  global  performance  of  the  system;  particularly,  when  time  between 
usrjread()'s  can  be  large,  thus  tying  up  valuable  file  systenvs  buffer  cache  between  an  mmap() 
and  its  munmap().  Our  plan  is  to  stay  with  the  conventional  implementation.  In  the  future,  with 
performance  studies  we  will  explore  the  global  effects  of  an  mmap  implementation. 


3  Mmap  and  Munmap 

An  mmap()  system  call  allows  a  user  to  share  a  file’s  data  that  resides  in  the  file  system’s  buffer 
cache  by  directly  mapping  the  buffer  cache  block  into  a  task's  virtual  address  space.  The  advantages 
of  sharing  data  between  the  operating  system  and  its  users  are  to  eliminate  e.\tra  copying  betw’een 
virtual  address  spaces:  and  to  provide  a  means  for  synchronization  between  tasks  of  different  address 


mmap 

sys_call 

no  copy  to  libbuf 

copy  from  system  to  lilibuf 

invoke  0  to  2  sy  sc  alls 

invoke  0  to  1  syscall 

buffer  cache  tied  up  by  user 

buffer  cache  not  controlled  by  user 

max  fd  =  max  mmapbufs 

no  relations  max  fd  and  mmapbufs 

access  1  FS  blk  per  syscall 

access  1  or  more  FS  blks  per  syscall 

T.able  1:  C’ompare  Read  and  Write  Using  Mmap  or  Sys_Call 
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spaces.  An  munmapf)  system  call  removes  the  mapping  of  part  or  all  of  the  mmapped  block  from 
a  task’s  address  space.  Exce])t  for  specifics  discussed  in  this  section  mmap()  and  munmap()  are 
intended  to  be  S\’ID  3  compliant. 

This  document  primarily  describes  mmap()  as  it  relates  to  data  shared  using  the  file  system’s  name 
space.  For  example,  we  will  not  be  discussing  anonymous  vvnap()  for  im])lementation  of  dynamic 
shared  memory  with  no  persistent  store. 

3.1  Characteristics  of  Mmap 

A  file  must  be  opened  prior  to  its  data  being  mmappfd  into  a  user's  address  space. 

Table  2  shows  all  valid  combinations  of  flags  specified  in  an  opai()  call  and  its  corresponding 
protection  flags  in  mmap(). 

Other  general  characteristics  of  mmap()  that  are  worth  mentioning  include: 

1.  the  offset  of  a  file  descriptor  is  not  affected  after  an  nininp()  call 

2.  reference  count  on  an  vniiapiKd  buffer  is  incremented  after  a  fork  ()  system  call  and  is  decre¬ 
mented  after  an  fieci)  system  call 

3.  SVR4  allows  mwupping  over  an  address  range  that  is  already  Munapped.  This  essentially 
performs  an  tntmmap()  of  the  old  segment  and  an  mmap()  of  the  new  segment  in  the  same 
address  range.  Because  there  is  potential  for  hiding  programming  errors,  currently  we  are 
inclined  to  be  stricter  in  our  functionality.  On  Tera  an  application  must  first  unmap  an 
existing  mapped  segment  before  another  physical  segment  can  be  mapped  within  the  same 
address  range.  Otherwise,  an  error  (EADDRISVSE)  will  be  returned  to  its  caller. 


open  flag 

mmap  prot  flag 

Return 

OJIDONLY 

PROTJIEAD 

OK 

OJIDO.XLY 

PROT.WRITE 

EACCES 

O.WKONLY 

PROTJIEAD 

EACCES 

O.WROXLY 

PROT.WRITE 

OK 

OJIDWR 

PROT.READ 

OK 

OJIDWR 

PROT.WRITE 

OK 

OJIDWR.O.CREAT 

PROT.READ 

OK 

O-RDWR.O-CREAT 

PROT.WRITE 

OK 

OJtDMR.O-APPEND 

PROT.RE.\r) 

OK 

O-RDWR.O-APPEND 

PROT.WRITE 

OK 

'I  able  2:  Validity  of  Open  and  Mmap  Flags 
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3.2  Mmap  on  Tera 


Architectural  tliffereiices  between  the  memory  models  of  the  Tera  Computer  System  [1]  and  most 
conventional  page  based  systems  are  manifested  in  the  mmap{)  system  call. 

Table  3  lists  some  of  the  major  differences  of  mmap  in  Tera  versus  those  in  other  page  based 
systems. 

The  Tera  memory  system  sujrports  a  segment  oriented  virtual  memory.  A  physical  segment  can 
vary  in  size  from  IK  to  32M  8-byte  words.  Since  a  file's  logical  data  blocks  are  not  guaranteed  to 
be  physically  contiguous  within  the  buffer  cache,  continuous  virtual  addresses  cairnot  be  ensured 
across  data  block  boundaries.  Therefore,  for  simplicity  our  current  implementation  of  mmap() 
with  PROT^SHARED  flag  allows  at  most  one  file  block  of  data  to  be  mapped  at  any  one  time  to 
a  segment. 

In  addition,  the  Tera’s  physical  memory  is  separated  into  program  and  data  memory.  Program 
memory  is  not  writeable  by  a  user  level  thread.  Therefore  a  protection  of  PROT.EXEC  for  mmap() 
has  been  eliminated  from  the  Tera  specification. 

3.3  Extensions  for  Tera 

According  to  SVID  3.  mmap()  cannot  write  beyond  an  existing  end  of  file.  On  Tera,  we  have 
extended  mmap()  to  map  to  newly  created  data  blocks  for  files  that  are  opened  with  write  permis¬ 
sion.  SVID  3  specifies  that  the  protection  option  of  PROF.WRITE defined  as  PROT.READ  and 
PROT.WRITE.  However,  Tera  provides  write-only  access  to  memory.  Therefore,  the  protection 
option  of  PROT.WRITE  h  defined  as  write-only. 

To  allow  a  user  to  have  better  control  over  the  actual  mapping  of  one  file  block  at  a  time  we  have 
extended  the  mmap()  semantics  by  creating  a  new  system  call,  known  as  mmap.ffiblk(J. 


pa  =  caddrj  mma])_fsblk( 
caddrj  addr, 
int  *len. 
int  prot. 
int  flags, 
int  fd. 
off_t  off); 


Page  Based  Systems  Tera 

Page  oriented  Segment  oriented 

Execute  in  data  pages  No  execute  in  data  .segments 

Map  across  fixed  phys  page  boundary  .No  map  across  variable'  phys  segment  boundary 


Table  3:  Compare  Mmap  on  Tera  vs  on  Page  Ba.sed  Systems 
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The  major  difference  between  nimapfj  and  mmap^sblkf )  is  the  parameter  kn.  In  mmapjsb1k(),  kn 
is  a  pointer  to  return  in  bytes  the  actual  size  of  the  segment  mapped.  Len  is  calculated  as  follows; 


int  Ibkno  =  (off  +  f_blksize)  /  f_blksize; 
int  *len  =  (Iblkno  *  f_blksize  -  off); 

where: 

Iblkno  is  the  logical  block  number  where  offset  (off)  lies 
and  ranges  from  1  on. 
f_blksize  is  the  data  block  size  of  the  file 
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